Cars4U - Price predictions on used cars

Objective

To explore and visualize the dataset, build a linear regression model to predict the prices of used cars, and generate a set of insights and recommendations that will help the business.

To preprocess the raw data, analyze it, and build a linear regression model to predict the price of used cars.

Key Questions

Data Description

The data contains the different attributes of used cars sold in different locations. The detailed data dictionary is given below.

Each record in the dataset provides a description of an car and its price. Some cars will have current price if that car was bought new.

Data Dictionary

Attribute Information (in order):

Flow & Approach

In order to understand the data from cars4u and provide a model that predicts prices of used cars, business recomendations we are going to do below steps

Import necessary python libraries

Loading and exploring the data

In this section the goals is to load the data and then to check its properties, size, data types.

Read Data from csv file and load data into a panda dataframe

reading csv file and load data to data frame

Common Methods on DataFrame

lets create common methods that we use multiple time during intial process and while data preprocessing

Checking Data

Check the fews rows of cars data, and check out its info() and describe() methods.

check random 10 rows how data looks

Data Shape

How many rows and columns in input data?

observations

Check Data Types

lets check all data types and non null value counts

observations

using describe check numerical column stats

observation on describe

Checking Null Counts

lets check which columns has some null values, how many null values

observations

Check for Duplicates

lets check for any duplicate values

observations

Top column values & observations

lets check how category column values count and top 10 values for each column

checking values for Name, year, location, fuel type, transmission type, owner types

observations

observations

Data preprocessing with Exploratory Data Analysis & Insights

Before data visualization lets fix the data

Processing columns - Column Conversions

Make columns category type

Let's explore the data!

check data types for possible column type conversions

possible category columns, save memory usage

observations

Processing columns - Data Coversions

Convert to numerical type columns

observations

Convert Price and New Price columns

observations

Feature Engineering - Make new Features

Extract Make & Model from Name

From Name column we can determine the Model and Make

observations

Determine Age of Car using Year

lets determine age of the used car using current year - Car Made year

this will be a useful feature column to predict the price

observations

Initial Univariate analysis

Univariate analysis helps to check data skewness and possible outliers and spread of the data.

Check how Price is distributed

Check histogram and boxplot for data spread, skewness and outliers

Observation on Price

Check how New_Price is distributed

Check histogram and boxplot for data spread, skewness and outliers

Observation on New_Price

Check how Kilometers_Driven is distributed

Check histogram and boxplot for data spread, skewness and outliers

Observation on Kilometers_Driven - And Fix Extreme Outlier

Check how Kilometers_Driven is distributed - after outlier treatment

Check histogram and boxplot for data spread, skewness and outliers

Observation on Kilometers_Driven

Missing value Treatment

Validate Data

lets compare all the columns one by one how many values are missing and decide how to fix those values or drop nan values

Column & Row level missing data

check on columns level which columns has how many missing values

check on column level

check on row level

observations

check by missing count

lets check and see if we can find a pattern and fix the missing values

Check Existing Data has values for missing fields

check existing data using name and year and see we can get data for power, engine and seats, missing is consistent accors the columns

Fix missing values - Missing value Treatment

Fix Engine Values

search existing data to find similar modal, make, year car has values, if yes use that value or if not fill with median value

observation

Fix Power Values

search existing data to find similar modal, make, year car has values, if yes use that value or if not fill with median value

observation

Fix Seats Values

search existing data to find similar modal, make, year car has values, if yes use that value or if not fill with median value

observation

checking missing value counts by columns - after initial treatment

observation.

Fix Missing price values

Observation

Fix new_price

observation

Outlier Treatment - Fix Outlier(s)

Outlier detection using IQR

lets find all outliers on numerical columns

observations on outliers

Data Distribution

Log transformation to fix skewness

Observation

Log on Price, New_Price and Kilometer_Driven really helped to distribute data evenly

Check once for any missing values after all treatment(s)

Observations missing data

Drop missing values

delete all missing values

No missing values

Feature Engineering - Price Type

Creating a price bucket using price range, Since car prices vary a lot and majority of the cars are below 20 lakhs

lets create bucket called ECONOMY, MID-SCALE and LUXURY

Drop unwanted columns

since Model, Make are made from name and Age made from year - we can drop Name and Year

Data Preprocessing - Result

how data looks after all treatments?

Exploratory Data Analysis - post Data pre-processing

Let's visualize the data to understand better

Univariate analysis & Bivariate analysis

Univariate Analysis

lets generate histplot and boxplot for all numeric features and understand how data spread accorss and any outliers

Price - price_log

Observation on Price

New Price

Observation on New_Price

Age of Car (related with Year Column)

Observation on Age_Of_Car

Car Mileage

Observation on Mileage

Engine size

Observation on Engine

Power

Observation on Power

Observation on Seats

kilometers driven

Observation on Kilometers_Driven

univariate analysis on Categories

method to generate count plot

Counts by Make and Price Type

Observations

age of the car

Observations

location

Observations

Observations

transmission

Observations

owner type

Observations

seats

Observations

Bivariate analysis - Heat map & Pair Plot

observations on heatmap

we have to keep these in mind when build a Linear model

pair plot

observations on pair plot

Bivariate analysis - feature vs price

lets see how prices is realted with critical numeric features

common method to generate joint plot

power vs price

observation on power vs price

engine vs price

observation on engine vs price

new_price vs price

observation on new price vs price

mileage vs price

observation on Mileage vs price

Mileage vs price

observation on mileage vs power

engine vs power

observation on engine vs power

power vs new price

observation on power vs new price

engine vs new price

observation on engine vs new price

Bivariate analysis - Category vs price

lets see how prices is realted with critical categorical features

common method to generate count plot and box plot

observations on all category columns vs price

Model Building

lets build different linear models and understand its performance and accuracy in predicting the price of the used cars

Build Different models

Model 1 - using all feaures

Model 2 - using all feaures deleting few category types

Model 3 - using Model 2 feaures without new price and highly correlated power or engine

Model 4 - using only highly correlated features with price

Model 5 - using SequentialFeatureSelector to decide features

all these models will be build, analyzed and scored similarly using same data

Model 1 - using all feaures

Define independent and dependent variables

Creating dummy variables

dummy variable has to be created for all category columns so that those can be used as features

observation post dummy variable creations

Split the data into train and test

we have to split the data in training set and test test, model will be built using training set and evaluated performance on test set

Fitting a linear model

Performance evaluator - Model performance evaluation

this method predicts R-squared, adjusted R-squared, errors like RMSE, MAE, MSE this method can be used for different Linear models

Checking Model 1 performance

observation on model 1

Build Alternate Model - Model 2

Model 2 - All Features, less category

Define independent and dependent variables

Create Dummy Variables

Build train and test data, build linear model, fit and check performance

observation on model 2

Checking performace of Previous Models

Observation

Model 2 doing better than model 1

Model 3 - using Model 2 feaures without new price and highly correlated power or engine

Build Alternate Model with Less Features

observation on model 3

Checking performance of Previous Models

Model 4 - using Model 2 highly corelated feaures only

Build Alternate Model with Less Features

Price Column has high relation with Engine, Power, New_Price, Age of car

Lets drop Mileage, Seats, Kilometers_driven

observation on model 4

Checking performance of Previous Models

Model 5 - Forward Feature Selection using SequentialFeatureSelector

fward feature selection starts with an empty model and adds in variables one by one.**

Observation

Lets build a model with top 28 features and see how results looks

Columns used for this Model

Observation

Model performance evaluation

Evaluating all 5 models on different performance metrics

Model Performance

Comparing all 5 models which has 93% and above accourance on both Training and test data, All models have almost similar error rate.

Model 2 wins high accurancy scores and less error score compared to all models

all models does well, But may be Model 5 given similar performance with very less feature might be easier for implementaion.

Lets Visualize how Model 2 fit closer to real test data

Create a scatterplot of the real test values versus the predicted values.

Model 2 shows good mapping of predict vs real values

Residuals

Let's quickly explore the residuals to make sure everything was okay with our data.

Plot a histogram of the residuals and make sure it looks normally distributed. Use either seaborn distplot, or just plt.hist().

we can see residuals distributed between -.5 to .5

Coefficients and Intercept of the best model

key Features thats provides better price

Features impacting car prices negatively

observations

negative impacts

Formula to calculate price for any used car.

Automating the equation of the fit

to calculate price of any future used car

Conclusion

Best Linear Model For we have is Model 2

Cars4U should focus on buying more

Cars4u should try not to buy